PHP trim函数剖析

这周遇到了一个关于 trim 函数的问题，问题是这样产生的，由于业务上需要计算购买人数，并对购买人数做特殊展示处理，规则是这样的购买人数大于 10000 展示 xx.x万 人，小于 10000 则展示原数据，产品还有一个特殊需求那就是如果计算结果刚好是 xx.0 那么 .0 是不需要的，所以当时是这么处理的 $buyNum = trim(round($buyNum / 10000, 1), '.0'); ，当数据是诸如 16.0 这样的时候是没有问题的，但是 10.0 这种数据就有问题了，由于 trim 函数特性，最终得到的结果是 1，和我们的预期大相径庭，那么这究竟是怎么一回事呢？那我们就一步步来探究它吧，彻底搞懂它，避免之后再踩坑。以下 PHP源码 基于 PHP7.1.6.

我们先看一下 PHP 官方文档 trim 的解释：

trim 函数原型：

1	trim ( string $str [, string $character_mask = " \t\n\r\0\x0B" ] ) : string

trim 函数说明：
This function returns a string with whitespace stripped from the beginning and end of str. Without the second parameter, trim() will strip these characters:
- “ “ (ASCII 32 (0x20)), an ordinary space.
- “\t” (ASCII 9 (0x09)), a tab.
- “\n” (ASCII 10 (0x0A)), a new line (line feed).
- “\r” (ASCII 13 (0x0D)), a carriage return.
- “\0” (ASCII 0 (0x00)), the NUL-byte.
- “\x0B” (ASCII 11 (0x0B)), a vertical tab.
trim 参数说明：
- str
  The string that will be trimmed.
- character_mask
  Optionally, the stripped characters can also be specified using the character_mask parameter. Simply list all characters that you want to be stripped. With .. you can specify a range of characters.

PHP官方 文档说的是这个函数的一个简单用法，没有第二个参数的时候，trim 函数默认去除 ' '（空格）、\t（水平制表符）、\n（换行符）、\r（回车符）、\0（空字节符）、\x0B/\v（垂直制表符）等几种字符。加了第二个参数以后就是去除所指定的字符，通过 .. 可以指定范围，看这个解释那么如下两个例子应该得到什么结果呢？

<?php
$str = 'Hello World';
$a = trim($str, 'Hdle');
$b = trim($str, 'HdWr');
var_dump($a);
var_dump($b);

结果应该是：

1 2	string(5) "o Wor" string(7) "ello ol"

执行以后呢？

1 2	string(5) "o Wor" string(9) "ello Worl"

和我们预期的结果区别不小，这是怎么回事呢？
回到我的那个处理购买人数问题上：

<?php
$buyNum = 99907;
$buyNum = trim(round($buyNum / 10000, 1), '.0');
var_dump($buyNum);
//string(1) "1"

有点奇怪吧，甚至是有点迷惑，这到底是什么意思呢？当然这个问题不用这个方法也能处理，但是 trim 不能处理或者说处理的不对到底是什么情况呢？看 PHP 官方 文档那应该结果就是 1 ，再看文档没有其他说明了，那就不看了吗？No！我们可以看 trim 源码实现，探究它的本质，真正了解它的实现原理，之后就不会再犯同样的错误，同样在别人说这个 trim 函数不好用有坑的时候你能知道为什么会有坑，坑是怎么产生的。

trim 源码实现在 php-7.1.6/ext/standard/string.c 中的 php_trim 方法，核心代码如下：

/* {{{ php_trim()
 * mode 1 : trim left
 * mode 2 : trim right
 * mode 3 : trim left and right
 * what indicates which chars are to be trimmed. NULL->default (' \t\n\r\v\0')
 */
PHPAPI zend_string *php_trim(zend_string *str, char *what, size_t what_len, int mode)
{
    const char *c = ZSTR_VAL(str);
    size_t len = ZSTR_LEN(str);
    register size_t i;
    size_t trimmed = 0;
    char mask[256];

    if (what) {
        if (what_len == 1) {
            char p = *what;
            if (mode & 1) {
                for (i = 0; i < len; i++) {
                    if (c[i] == p) {
                        trimmed++;
                    } else {
                        break;
                    }
                }
                len -= trimmed;
                c += trimmed;
            }
            if (mode & 2) {
                if (len > 0) {
                    i = len - 1;
                    do {
                        if (c[i] == p) {
                            len--;
                        } else {
                            break;
                        }
                    } while (i-- != 0);
                }
            }
        } else {
            php_charmask((unsigned char*)what, what_len, mask);

            if (mode & 1) {
                for (i = 0; i < len; i++) {
                    if (mask[(unsigned char)c[i]]) {
                        trimmed++;
                    } else {
                        break;
                    }
                }
                len -= trimmed;
                c += trimmed;
            }
            if (mode & 2) {
                if (len > 0) {
                    i = len - 1;
                    do {
                        if (mask[(unsigned char)c[i]]) {
                            len--;
                        } else {
                            break;
                        }
                    } while (i-- != 0);
                }
            }
        }
    } else {
        if (mode & 1) {
            for (i = 0; i < len; i++) {
                if ((unsigned char)c[i] <= ' ' &&
                    (c[i] == ' ' || c[i] == '\n' || c[i] == '\r' || c[i] == '\t' || c[i] == '\v' || c[i] == '\0')) {
                    trimmed++;
                } else {
                    break;
                }
            }
            len -= trimmed;
            c += trimmed;
        }
        if (mode & 2) {
            if (len > 0) {
                i = len - 1;
                do {
                    if ((unsigned char)c[i] <= ' ' &&
                        (c[i] == ' ' || c[i] == '\n' || c[i] == '\r' || c[i] == '\t' || c[i] == '\v' || c[i] == '\0')) {
                        len--;
                    } else {
                        break;
                    }
                } while (i-- != 0);
            }
        }
    }

    if (ZSTR_LEN(str) == len) {
        return zend_string_copy(str);
    } else {
        return zend_string_init(c, len, 0);
    }
}

函数上面的注释说明了这个函数的参数含义：

str：原字符串
what：需要去除的指定字符串
what_len：需要去除的指定字符串长度
mode：去除类型，左去除，右去除，左右去除

trim 函数处理逻辑：

判断是否设置去除内容 what，没有设置去除默认字符（' \t\n\r\v\0'）;
判断去除内容长度，1个字符和多个字符去除;
使用 mode 与 1 和 2 按位与运算判断左右去除;
trim 多个字符去除，是循环去除，直到遇到第一个不在列表中的字符。

这里我们看多字符去除，单字符去除没有歧义，主要是对多字符去除有疑惑，多字符去除主要处理在 php_charmask 函数，定义如下：

/* {{{ php_charmask
 * Fills a 256-byte bytemask with input. You can specify a range like 'a..z',
 * it needs to be incrementing.
 * Returns: FAILURE/SUCCESS whether the input was correct (i.e. no range errors)
 */
static inline int php_charmask(unsigned char *input, size_t len, char *mask)
{
    unsigned char *end;
    unsigned char c;
    int result = SUCCESS;

    memset(mask, 0, 256);
    for (end = input+len; input < end; input++) {
        c=*input;
        if ((input+3 < end) && input[1] == '.' && input[2] == '.'
                && input[3] >= c) {
            memset(mask+c, 1, input[3] - c + 1);
            input+=3;
        } else if ((input+1 < end) && input[0] == '.' && input[1] == '.') {
            /* Error, try to be as helpful as possible:
               (a range ending/starting with '.' won't be captured here) */
            if (end-len >= input) { /* there was no 'left' char */
                php_error_docref(NULL, E_WARNING, "Invalid '..'-range, no character to the left of '..'");
                result = FAILURE;
                continue;
            }
            if (input+2 >= end) { /* there is no 'right' char */
                php_error_docref(NULL, E_WARNING, "Invalid '..'-range, no character to the right of '..'");
                result = FAILURE;
                continue;
            }
            if (input[-1] > input[2]) { /* wrong order */
                php_error_docref(NULL, E_WARNING, "Invalid '..'-range, '..'-range needs to be incrementing");
                result = FAILURE;
                continue;
            }
            /* FIXME: better error (a..b..c is the only left possibility?) */
            php_error_docref(NULL, E_WARNING, "Invalid '..'-range");
            result = FAILURE;
            continue;
        } else {
            mask[c]=1;
        }
    }
    return result;
}

php_charmask 函数使用一个 mask 字节数组来标记那些需要去除的字符串，然后执行操作和去除一个字符类似，只是结束条件是寻找到第一个不在字符表里的元素。同时我们也能看到函数对于范围去除的处理，也就是 trim 函数第二个参数中的 .. ，同时也说明了在使用 trim 函数时第二个参数不能有三个点 ... 否则会报错。
了解了 trim 函数内部实现原理以后，下面我们来通过 GDB 跟踪一下函数内部实现

 gdb php
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-114.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/local/php71/bin/php...done.
(gdb) b php_charmask
Breakpoint 1 at 0x730705: php_charmask. (4 locations)
(gdb) r ~/Code/PHP/trim.php
Starting program: /bin/php ~/Code/PHP/trim.php
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Breakpoint 1, php_trim (str=0x7ffff1c03560, what=0x7ffff1c03598 ".0",
    what_len=2, mode=3) at /usr/src/php-7.1.6/ext/standard/string.c:829
829                             php_charmask((unsigned char*)what, what_len, mask);
(gdb) s
php_charmask (mask=0x7fffffffaa30 "", len=2, input=0x7ffff1c03598 ".0")
    at /usr/src/php-7.1.6/ext/standard/string.c:751
751             memset(mask, 0, 256);
(gdb) n
752             for (end = input+len; input < end; input++) {
(gdb) p input
$1 = (unsigned char *) 0x7ffff1c03598 ".0"
(gdb) p *input
$2 = 46 '.'
(gdb) n
751             memset(mask, 0, 256);
(gdb)
752             for (end = input+len; input < end; input++) {
(gdb)
751             memset(mask, 0, 256);
(gdb)
752             for (end = input+len; input < end; input++) {
(gdb)
761                             if (end-len >= input) { /* there was no 'left' char */
(gdb)
php_trim (str=0x7ffff1c03560, what=<optimized out>, what_len=<optimized out>,
    mode=<optimized out>) at /usr/src/php-7.1.6/ext/standard/string.c:829
829                             php_charmask((unsigned char*)what, what_len, mask);
(gdb) s
php_charmask (mask=0x7fffffffaa30 "", len=<optimized out>,
    input=0x7ffff1c03598 ".0") at /usr/src/php-7.1.6/ext/standard/string.c:754
754                     if ((input+3 < end) && input[1] == '.' && input[2] == '.'
(gdb)
753                     c=*input;
(gdb) p c
$3 = <optimized out>
(gdb) p *input
$4 = 46 '.'
(gdb) n
754                     if ((input+3 < end) && input[1] == '.' && input[2] == '.'
(gdb) n
758                     } else if ((input+1 < end) && input[0] == '.' && input[1] == '.') {
(gdb) n
797             size_t len = ZSTR_LEN(str);
(gdb) p str
$5 = (zend_string *) 0x7ffff1c03560
(gdb) p *str
$6 = {gc = {refcount = 0, u = {v = {type = 6 '\006', flags = 2 '\002',
        gc_info = 0}, type_info = 518}}, h = 9223372043238031460, len = 4,
  val = "1"}
(gdb) p len
$7 = 4
(gdb) n
829                             php_charmask((unsigned char*)what, what_len, mask);
(gdb) s
php_charmask (mask=0x7fffffffaa30 "", len=<optimized out>,
    input=0x7ffff1c03599 "0") at /usr/src/php-7.1.6/ext/standard/string.c:754
754                     if ((input+3 < end) && input[1] == '.' && input[2] == '.'
(gdb)
753                     c=*input;
(gdb)
754                     if ((input+3 < end) && input[1] == '.' && input[2] == '.'
(gdb)
758                     } else if ((input+1 < end) && input[0] == '.' && input[1] == '.') {
(gdb)
781                             mask[c]=1;

(gdb) p c
$8 = 48 '0'
(gdb) n
752             for (end = input+len; input < end; input++) {
(gdb)
php_trim (str=0x7ffff1c03560, what=<optimized out>, what_len=<optimized out>,
    mode=<optimized out>) at /usr/src/php-7.1.6/ext/standard/string.c:831
831                             if (mode & 1) {
(gdb)
797             size_t len = ZSTR_LEN(str);
(gdb) n
831                             if (mode & 1) {
(gdb) n
832                                     for (i = 0; i < len; i++) {
(gdb) n
833                                             if (mask[(unsigned char)c[i]]) {
(gdb) n
839                                     len -= trimmed;
(gdb) n
840                                     c += trimmed;
(gdb) n
839                                     len -= trimmed;
(gdb) n
842                             if (mode & 2) {
(gdb) n
843                                     if (len > 0) {
(gdb) n
846                                                     if (mask[(unsigned char)c[i]]) {
(gdb) p c
$10 = 0x7ffff1c03578 "10.0"
(gdb) p c[i]
$11 = 48 '0'
(gdb) p i
$12 = 3
(gdb) n
851                                             } while (i-- != 0);
(gdb) n
846                                                     if (mask[(unsigned char)c[i]]) {
(gdb) p c[i]
$13 = 46 '.'
(gdb) p i
$14 = 2
(gdb) n
851                                             } while (i-- != 0);
(gdb) n
846                                                     if (mask[(unsigned char)c[i]]) {
(gdb) p c[i]
$15 = 48 '0'
(gdb) n
851                                             } while (i-- != 0);
(gdb) n
846                                                     if (mask[(unsigned char)c[i]]) {
(gdb) p c[i]
$16 = 49 '1'
(gdb) p i
$17 = 0
(gdb) n
883             if (ZSTR_LEN(str) == len) {
(gdb) p len
$18 = 1
(gdb) n
886                     return zend_string_init(c, len, 0);
(gdb) p c
$19 = 0x7ffff1c03578 "10.0"
(gdb)
$20 = 0x7ffff1c03578 "10.0"
(gdb) p *c
$21 = 49 '1'

最终验证了我们的结论，也加深了对 trim 函数的理解，这里做个延伸，由于 trim 是基于字节去除的，所以在去除中文的时候可能会出现乱码，这是由于汉字是 UTF-8 编码，一个汉字占 3字节,所以可能会出现乱码，知道了函数实现原理以及实现细节可以避免踩很多坑。

1 2	>>> trim('品、', '、') => b"å“"